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1. INTRODUCTION 


This report deals with both the modeling and measurement of fault-tolerant mul- 
tiprocessors. A detailed analysis of systems of this type is desired because of the increas- 
ing number of mission-critical situations in which they are used. One would like to be 
able to predict the performance of such systems for various workloads and how well they 
recover from system errors. The speed and effectiveness of the recovery procedures for a 
fault-tolerant multiprocessor have a direct effect on its performance. 

In the first part of this report we present a model to analyze the performance of a 
unibus 1 multiprocessor. A closed queueing network is developed to study the effects of 
workload variation on bus contention, processor utilization, and performance. This 
development entails representing the computer system with a modified Stochastic Petri 
Net(SPN). This aids in illustrating the operation of the specific system and determining 
which factors have the most significant effect on performance. 

A second component of this report pertains to the measuring of fault latency in a 
multiprocessor environment. This entails explicitly determining the distribution of fault 
latency and its significance in system modeling and analysis. The result of this research 
shows that fault latency is significant and that the common assumption of a negligible 
fault latency may be incorrect. 

An existing system, the Fault-Tolerant Multiprocessor (FTMP) located at the 
NASA AIRLAB[l7-20], is used as a modeling example. Many experiments have been 
made on this system to measure fault latency and performance related factors, such as 
bus contention and idle processors. It is the results of some of these experiments that 
justify the conclusions drawn concerning fault latency. 

'This unibus can consist of redundant buses which logically act as a unibus. 
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The rest of this report is organized as follows. Section 2 deals with the modeling of 
fault-tolerant unibus multiprocessors and is divided into seven subsections. In Subsection 
2.1 the performance modeling is introduced. Subsection 2.2 describes the specific archi- 
tecture being addressed, a real-time unibus multiprocessor, and its operation. Subsec- 
tions 2.3 and 2.4 describe the SPN model and the closed queueing network model, 
respectively. The results of the queueing model and closed form solutions are presented 
in Subsection 2.5. The experimental system, FTMP, is described briefly in Subsection 
2.6. Subsection 2.7 shows the queueing model representation of FTMP and some meas- 
ured experimental results pertaining to its performance. 

In Section 3, we present the technique of characterizing fault latency, which is an 
important system parameter for modeling computer systems. Subsection 3.1 introduces 
the concept and approach to measuring fault latency. A methodology for measuring 
fault latency is outlined in Subsection 3.2. An example of the application of the method 
on FTMP is shown with experimental results in Subsection 3.3. Finally, the report con- 
cludes with Section 4. 

2. PERFORMANCE MODELING OF REAL-TIME MULTIPROCESSORS 
2.1. Introduction 

Representing the operation of a computer system by a structured model is a popu- 
lar and natural approach to the study of a computer’s performance. Many factors need 
to be incorporated into the model so that it accurately describes the system that is being 
modeled. The type of analysis desired dictates which factors of the computer’s operation 
need to be incorporated into the modeling framework. A factor that is almost always 
included, especially in the study of computer performance, is the representation of the 
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workload handled by the computer system being analyzed. The workload is an essential 
part of the performance evaluation of any computer system, because how well a com- 
puter performs is directly related to the type of workload it is handling. 

First, we present the beginning stages in the development of a model to study the 
workload effects on performance for a specific computer architecture and application. 
The type of system being addressed is a highly reliable unibus 2 multiprocessor that is 
used in real-time control. Dealing exclusively with real-time systems in the evaluation of 
multiprocessor performance is an approach that has not been largely addressed in the 
literature. Usually, a general purpose multiprocessor is discussed, as in [1-3]. This type 
of approach is difficult because of the largely varied workload general purpose systems 
handle. Trying to represent a system of this type with its workload becomes unreason- 
ably complex, if one wants to properly describe the workload effects on performance. It 
appears that a number of interesting results can be obtained if one only considers the 
structure of a real-time system and its workload. 

The detailed analysis of this type of system is desired because of the increasing 
number of critical situations it is used for, c.g., control of aircraft, spacecraft, nuclear 
reactors, etc., where the failure of the controlling computer would result in catastrophic 
losses. A failure could be the result of a physical malfunction or the result of the system 
not reacting quickly enough as required[4|. 

Many authors have presented designs for synthetic workloads [5-8]. They have 
usually relied on heuristic methods that seem to provide an adequate workload for a gen- 
eral class of computing systems. Recently, Ferrari [9] has made the point that a more 
systematic method is necessary, because of the fundamental correlation between work- 

2 As mentioned earlier, this can be redundant buses. 
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load modeling and any performance evaluation. Developing such a method is more com- 
plicated than it might first appear. One first needs to define what the workload model 
should cover in its representation, and what standard should be used to determine if a 
workload model is a "good” model. 

We view a real-time computer system as the combination of two closely dependent 
components: the controlled procets and the controlling computer [4]. Because of this 
close dependency, we feel that the development of a synthetic workload for this type of 
system should not only rely on the actual workload being modeled, but it should also 
depend on the type of system handling the workload. It is this basic association that 
sets our work apart from those of others. Having a specialized synthetic workload of 
this type provides us with a means for producing more useful results relating to the per- 
formance evaluation of real-time computing systems. 

Typically, the workload of a real-time system is a fixed group of tasks that have to 
be performed at certain intervals, repeatedly . There is usually a group of’ short, fre- 
quently initiated tasks that monitor internal and external conditions and continually 
compensate for their change. There are also tasks that are initiated less frequently that 
require more computation time. The relative frequencies of the initiation of tasks, and 
the number of tasks that need to be completed in a certain time frame lead to strict per- 
formance criteria. 

It would be desirable to be able to determine if a computer system with the archi- 
tecture mentioned above could handle a given workload and set of performance criteria. 
If it can, one would like to know how this might be best accomplished. And finally, it 
would be useful if this optimal performance could be measured. The model presented 
here will hopefully aid in solving some of these problems. 
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Vital factors can be determined directly from the model, snch as the amount of 
processor idle time, the degree of contention for the single bus, and the tasks that have 
the most significant effect on performance. These details will be discussed in a later sub- 
section. The model can also be used as a tool for determining the optimal workload dis- 
tribution to reach a certain level of performance. 

2.2. System Architecture and Operation 

As mentioned earlier, the hardware system addressed here is a highly reliable 
unibus multiprocessor. The general structure of such a system is shown in Figure 1. It 
consists of four major components processing clusters, input/ output links, a time-shared 
system bus, and system memory. A description of each of these will be discussed as well 
as their assumed interdependencies. 

A processing cluster is an entity that is capable of operating on one task at a time. 
It consists of one or more pairs of a processing unit and its local memory. The degree of 
redundancy is considered immaterial to the performance of the cluster for a given task. 
Although, the redundancy does have a significant impact on the reliability and confi- 
guration aspects of system operation. What is important is that regardless of how many 
pairs there are in a cluster, they all work together on a single task. For example, a clus- 
ter may represent a triple modular redundant (TMR) system of three processing units 
and their local memories. It is also assumed that all the clusters in the system are of the 
same type, i.e., they all contain the same number of processor-memory pairs. 

An input/ output link is a component that enables data to be transmitted to or from 
an external device. These allow the system to read data from sensors and transmit data 
to actuators and displays. These links are also the channels used for human interface 
through terminals or other similar devices. 
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PC 1 



Figure 1. System Architecture 


PC = PROCESSING CLUSTER 
LM = LOCAL MEMORY 
I/O = INPUT/OUTPUT LINK 





The time-shared system bus interconnects all the processing clusters, I/O links, and 
system memory. It is the medium for exchanging all data and control signals. Again, 
this bus may be redundant for reliability reasons, but only one cluster transmits and 
receives data over all copies of the bus at a time. Therefore, the redundant system bus 
logically acts as a unibus. A cluster communicating over the bus is said to control the 
bus. 

Finally, there exists a single system memory that is addressable over the system 
bus. This memory usually consists of a collection of dynamic RAMs. The system 
memory may be redundant with the restriction that only one system memory location 
may be addressed at a time. 

The basic operating principles of this multiprocessor system can be explained as fol- 
lows. All tasks to be executed by the system are stored in system memory. These tasks 
can be divided into n job classes, where a job class consists of tasks that are required to 
repeatedly execute at the same relative frequency. More specifically, tasks of job class « 

are executed every r, seconds, where — is the frequency of initiation of a task of job 

'» 

class » . There may be more than one job class having the same relative frequency for its 
tasks. The set of job classes is a partition of the set of system tasks, where a task is in 
one and only one job class. 

Each job class is given a priority. This priority is used to determine which process- 
ing cluster may use the system bus when there is a contention among clusters for bus 
control. A cluster about to work on or currently working on a task from job class i has 
priority over another cluster to control the system bus, if the other cluster is about to 
work on or is currently working on a task from job class j , where 1 < i < j < n. 
Priority of clusters working on tasks of the same job class is determined by a first come 
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first served (FCFS) policy. Task queues are kept for each job class and these reside in 
system memory also. 

An idle cluster wishing to process a task from job class » must first gain control of 
the system bus. It does this by waiting for inactivity on the bus and proceeds to partici- 
pate in a polling sequence. A polling sequence is a decision process to determine which 
cluster has the highest priority. This is conveniently done by requiring each of the clus- 
ters to transmit their priority number over the system bus and having a voting mechan- 
ism determine which cluster has the highest priority. As a result of the polling sequence, 
the cluster with the highest priority is given control of the bus. 

At this time the cluster reads the task queue for job class » from system memory 
and determines the next task to be executed. It then reads in the task and all data 
necessary to process that task. This data can be obtained from I/O link reads or more 
system memory reads. After obtaining all the information necessary to internally exe- 
cute the task, the cluster updates the job queue in system memory and releases the bus. 
There are other mechanisms such as counters, queues, and interrupt timers to aid a clus- 
ter in determining which job class to request. When a cluster completes a task, it will 
again request bus control and transmit its results to the relevant addresses, determine 
which job class to work on next, and proceed as before. 

At any particular instant, all the clusters could be processing tasks simultaneously 
resulting in peak performance. Performance dwindles when a cluster becomes idle wait- 
ing for control of the system bus. There is also a penalty in performance, or system 
failure, if all the clusters are not able to keep up with the required frequency of task exe- 
cution for each job class. 
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A reasonable question to address is how are the job classes formed? More specifi- 
cally, given a system workload, what is the best number of job classes and the distribu- 
tion of the tasks among these classes? For a general purpose computing system’s work- 
load, this is difficult to determine [8]. Some of the main problems in representing the 
workload in a general purpose multiprocessor system model are (l) showing the inter- 
dependencies among tasks in the workload, (2) the fact that the workload may not be 
stationary, i.e., tasks of one type might occur at different rates at different times, (3) the 
unlimited number of tasks possible, and (4) the contention for physical components 
needed to execute tasks operating concurrently. Providing a model that is able to 
represent all these features would be extremely difficult, if not impossible. Fortunately, 
when one only considers real-time applications on a unibus system these problems 
become relatively easier to address. The workload of a real-time system is usually a 
fixed set of tasks that have to be executed in a prescribed order at regular intervals. 
This makes determining the physical and logical interdependencies more tractable. It 
also implies a stationarity among the relative frequencies of different tasks. Therefore, 
natural job classes can be formed and parameterized. However, this still is not an easy 
task. 


2.3. Stochastic Petri Net Model 

In the development of the model, it was first necessary to represent the overall 
operation of the system at some level of abstraction that would be amenable to the type 
of performance analysis desired. This representation is needed to depict the various 
states a processing cluster might be in. The features that have a significant effect on 
performance are system bus contention, transmission delays, and possible idle periods of 
a processing cluster. By modeling at the system level, where the components of concern 
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are the tasks, clusters, and system bus, we are able to describe the stages a processing 
cluster will go through and how its actions affect the operation of the other clusters. 

A useful tool for showing synchronization among system components is a Stochastic 
Petri Net (SPN) [10-11]. Figure 2 is an example of a modified stochastic Petri net which 
describes the synchronous actions of the system referred to in this report. This is a modi- 
fied SPN because of the presence of the three function blocks Fl, F2, and F3. 

A SPN is a structure consisting of places, transitions, and directed arcs connecting 
transitions and places. A place is usually represented in a net drawing as a circle, while 
transitions are shown with bars. Directed arcs connect these places and transitions in a 
way that there is no arc going directly from a place to another place, or from a transi- 
tion to a transition. Tokens or dot markings in a place represent collectively the state of 
the SPN. 

A transition will fire when it becomes enabled. A transition is enabled when there 
exists at least one token in each input place to the transition. The process of firing a 
transition results in one token being removed from each place for each arc entering the 
transition, and a single token placed in all of the places that have input arcs emanating 
from that transition. A transition may fire instantaneously, such transitions are 
represented by solid bars (T1 - T9 in Fig. 2), or have an exponentially distributed random 
duration, such transitions are called timed transitions and are represented by hollow 
vertical bars (T10 - T21). When an instantaneous transition is enabled, tokens are 
immediately removed from input places and sent to output places. When a timed tran- 
sitions is enabled, there is an exponentially distributed delay before tokens are removed 
from input places and immediately sent to output places. 
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The function blocks in Figure 2 are not defined components of a true SPN. They 
are used here to simplify the appearance of the figure. The functions represented by 
each block have been expressed as SPNs themselves. They act on the input arcs to the 
block and produce tokens at the output arcs. The functions they represent are trivial in 
nature, but the SPNs are complex and cloud their simplicity. For example, F3 has been 
expressed using 13 places, 11 transitions, and 46 directed arcs. 

Figure 2 is an SPN for a three cluster, unibus multiprocessor. How a single cluster 
is incorporated in this model will now be explained. The extension from one cluster, to 
three, and more will be simple to envision. There are seven places (PI, P2, P7, P8, P13, 
PI®, and P21), three instantaneous transitions (Tl, T4, and T5), and four timed transi- 
tions (T10, Til, T16, and T21) necessary to describe the operations of a single cluster. 
What they represent is described in Table 1 and Table 2. 

The Fl (Poll) function block is activated whenever there is a token present in 
places 8, 10, or 12, i.e., when there is a poll request. It performs the action of removing 
these tokens if they are present and deciding which of the requesting clusters should 
obtain control of the bus. On output one token will be placed in either place 13, 14, or 
15, depending on which cluster has gained control of the bus. There will also be tokens 
placed in places 7, 9, and 11, if that cluster has lost the poll sequence. For example, sup- 
pose cluster 1 and cluster 2 both initiate a poll sequence and cluster 1 is to succeed in 
the poll. Initially, there would be a token in places 8 and 10. This indicates that cluster 
1 (place 8) and cluster 2 (place 10) wish to initiate a poll sequence. The Poll function 
would remove these tokens, and after a delay representing the time it takes to perform a 
poll, will place a token in place 13 (cluster 1 has succeeded) and one in place 9 (cluster 2 
has lost the poll). 
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Place 

A token in this place means that .... 

PI 

a system bus request has been made by the cluster. 

P2 

the system bus is free as seen by the cluster. 

P7 

the cluster has lost a poll sequence. 

P8 

the cluster is initiating a poll sequence. 

P13 

the cluster has succeeded in a poll sequence and has been granted 
bus control. 

P18 

the cluster has completed its bus transactions and is to become idle. 

P21 

the cluster is ready to begin processing a task. 


Table 1. Place Descriptions 
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Transition 

The firing of this transition represents ... 

T1 

the cluster determining that the bus is busy. 

T4 

the cluster acknowledges that it has lost a poll sequence 
and must wait to make another request for the system bus. 

T8 

the cluster initiating a poll sequence. 

T10 

the cluster transmitting on the system bus. 

Til 

the cluster transmitting on the system bus. 

Tlfl 

the cluster remaining in an idle state. 

T21 

the cluster internally executing a task. 


Table 2. Transition Descriptions 
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Functions F2 (Bus Release) and F3 (Disable) act to indicate that the bus has 
become free or is busy. Function F2 acts by keeping track of which clusters are in a poll 
sequence, thus transmitting on the system bus, or which are communicating over the 
system bus. When all activity is completed by all the relevant clusters, the F2 function 
will indicate that the bus is free by placing a token in places 2, 4, and 6. Function F3 
acts to disable other clusters from initiating a poll sequence if the bus is currently busy. 
Therefore, when a poll request is made the F3 function will determine which of the clus- 
ters should be disabled. 

Figure 2 completely describes the system we are interested in. One is able to follow 
the actions of a single processing cluster and observe the effects of these actions on the 
rest of the system. The model serves the purpose of enabling us to see which actions of 
a computing cluster have the greatest effect on system performance. For example, by 
supplying transition rates for the timed transitions, one could determine how often a bus 
request is made. Combining this with information on the duration of a typical transmis- 
sion will give us an idea of how often the bus is busy. With this result, it can be intui- 
tively stated that the higher the bus request frequency is, the greater the possibility of 
bus contention. 

It can be observed that this model, were it completely expressed with valid SPN 
components, would be cumbersome and confusing. Malloy [10] has shown that SPNs are 
isomorphic to continuous parameter Markov chains. An SPN can be converted to a 
Markov chain and completely analyzed. One drawback of this method is that the state 
space for such a Markov chain is large. It is unmanageably large for the example of Fig- 
ure 2 (keep in mind that the SPN for a function block is larger than the rest of the 
model shown). Therefore, it is obvious that using this model directly as a tool for per- 


15 



formance evaluation of the system of interest is inappropriate. A simpler model has to 
be derived that expresses the same relationships. Such a model is introduced in the next 
section. 

2.4. Queueing Model Description 

The model presented in this section is designed to represent the states of a unibus 
multiprocessor system. The state of the system is defined by a combination of the states 
of all the processing clusters. With the aid of the model outlined in the previous section, 
the states that were determined to be relevant to system performance are when a pro- 
cessing cluster is (1) competing in a poll sequence, (2) communicating on the system bus, 
(3) processing a task from job class t, or (4) idle, i.e., not processing a task. The rela- 
tionship between these states of a processing cluster and the relationship between clus- 
ters can be inferred from Figure 2. 

These relationships are incorporated into the closed queueing network shown in 
Figure 3. This model has a number of advantages over the SPN model, besides the obvi- 
ous of being simpler to understand. First, it reduces all the actions of bus contention 
and the polling sequence into a single non~preemptivc priority queue. A non-preemptive 
priority queue is one where each of the arriving customers has an associated priority. A 
customer entering the queue will move ahead of all the customers in the queue that have 
lower priorities, and behind those of equal or higher priority. In this manner, customers 
of the highest priority in the queue are served first on a FCFS basis. The second advan- 
tage is that the separate job classes can be explicitly parameterized in this model, 
whereas in the SPN model they were all grouped together. Third and most importantly, 
this model can be easily solved for a given set of parameters. 


16 



NODE 3 



Figure 3. Queueing Model 






Before describing the details of this queueing model, it should first be noted that 
the parameters and node representations of this model differ from those of conventional 
queueing models. Typically, the nodes of a queueing model represent servers of some 
type, e.g., processors, workers, etc. The associated parameter for each node usually 
describes the exponential service rate for the server. The tokens or markings moving 
about the model represent customers that desire service, e.g., programs, jobs, etc. The 
actions of a closed queueing model can be described as a token arriving at a node, wait- 
if necessary , a certain length of time for service, being served for a length of time, 
and moving on to the next node. The model described here reverses the conventional 
meanings of node and token. In this model, a node represents a customer that needs ser- 
vice, and the associated exponential service rate describes how long it takes to complete 
that service. The tokens on the other hand represent servers, where all the servers are 
identical. Therefore, this model represents servers moving from customer to customer 
and performing the service requested by that customer. This unorthodox convention is 
used because (1) it simplifies the model, and (2) it explicitly shows the state the system 
is in by showing what state each processing cluster is in. 

It is the goal to determine the steady state probabilities for the distribution of clus- 
ters among the different states. 3 Since it is safe to assume that the system will reach 
steady state before a cluster fails, the number of clusters remains constant in the 
analysis. Typical values for the mean time between failures (MTBF) are in the order of 
10 3 -10 4 hours. Whereas, steady state can be reached in a matter of minutes at most. 

Once steady state is reached, a cluster may fail. At that point we have a system 
with one less cluster, and it is reasonable to assume that this system will reach steady 

®This will be shown in Section 2.5. 
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state before another failure occurs. The performance of this degraded system will be less 
than that of the previous system. To obtain the overall performance of the system 
operating over a certain length of time, the performance contributions of each of the 
configurations are combined, weighted by their relative time of operation. Therefore, in 
the following analysis, we will assume that no cluster fails and that the number of clus- 
ters remains constant. 

In this model m equals the total number of homogeneous clusters in the system. 
Since the number of tokens in this closed queueing model remains constant, it is justifi- 
able to have each token represent a cluster. Therefore, there are exactly m tokens 
present in the system at all times. The nodes represent the activities that are performed 
by a cluster, e.g., a cluster is in the idle state if it is idle. 

There are n + 2 nodes in this model. Again, n is the number of different job 
classes in the workload. As stated before, tasks that belong to the same job class are 
assumed to each have the same distribution of internal processing time. It is assumed 
that this processing time is an exponentially distributed random variable. The number 
of tasks in a job class has to be greater than or equal to one. Each of these job classes is 
given a priority level, where all tasks of the same job class have the same priority and a 
task from class » has priority over a task of class j when 1 < i < j < n. 

Each of the nodes will be described below. 

NODE 1 : This node represents the transmission activity over the system bus. It con- 
sists of a non-preemptive priority queue and a transmission server. A token 
at this node represents a cluster that is either waiting to transmit on the sys- 
tem bus or currently transmitting. The parameter // s describes the the 

transmission rate of a cluster, i.e., — — is the average transmission duration. 

Ms 
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A non-preemptive priority queue is used to show that a cluster that has just 
completed a task from class i is given priority to transmit over a cluster that 
has completed a task of class j , where again 1 < i < j < n. Clusters 
completing tasks of the same class are able to transmit on a FCFS basis. 

NODE 2 : A token at this node represents a cluster that is idle, i.e., performing no 
useful computations. It is a multiserver node with m servers. A node of this 
type is used to indicate that all the clusters may be served at this node with 
no queue forming. This is equivalent to saying that all the clusters may be 
idle at the same time. The sojourn time in this idle state for a cluster is 
assumed to be exponentially distributed with rate pj. The rate at which clus- 
ters leave this node is k pj, where k is the number of tokens being served by 
the node. 

NODES 3 through n+2 : These n nodes represent the different job classes. Node 
» +2 represents a processing activity on a task of class t . Again, as with 
node 2, these are multiserver nodes with m servers. Thus, no queue forms at 
any of the nodes. This type of node is used to indicate that all the clusters 
could be working on tasks from the same job class. The parameter p, is the 
rate describing the processing duration of a task of class i . Typically, /*,- > 
Pj when » < j . The rate at which clusters leave the node i +2 is k p , , 
where k is the number of tokens being served by the particular node. 

The final parameters in the model that need explanation are the branch probabili- 
ties. When a cluster completes a transmission, it either drops into the idle state or con- 
tinues processing. The probability that the next state is the idle state is P t and simi- 
larly, the probability that the next state is a processing state is P P . Obviously, 
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Pj+P p =z 1 . When a processor is to enter a processing state, there is the probability P, 


of it being the processing of a job of class * , where P, =1. Typically, P,- > Py 

i=l 

when i < J . 


2.5. Solutions to the Queueing Model 

The common approach to solving for the steady state probabilities of a queueing 
model is to convert the model to that of a continuous parameter Markov chain [12]. This 
approach will be used to solve the queueing model presented here. For the construction 
of a Markov chain, we make the following definitions. 

Definition 1 s A clutter itatc is a pair (c, ,n, ), where c, 6 {1,2,..., m } is a number label- 
ing a particular processing cluster, and n,- (E {l,2,...,n +2} is the number of 
the node where the token representing the cluster is located. There are 
m-(n+2) cluster states. 

Definition 2 : A system state is an m-tuple ( s x ,8 2 , . . . , s m ) € S 1 XS 2 X • • * XS n 
where S is the set of all cluster states whose first component is c, . There 
are(n+2) m system states. 

An example of a system state for a system with three clusters and three job classes 
is ((1,1), (2, 3), (3,1)). This represents the configuration when clusters 1 and 3 are waiting 
to communicate on the system bus or are currently communicating, and cluster 2 is pro- 
cessing a task from job class 1. 

From an analysis standpoint, a sj r stcm state contains more information than is 
necessary. We are only concerned with how many clusters there are at a particular 
node. We do not need to know which they are, because they all require the same 
amount of time to process the task at a particular node. It is the number of clusters 
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that determines how fast tasks are completed or delayed at a node. This motivates the 
following definition. 

Definition 3 : A reduced system state is the n+2-tuple ( a u a 2 , . . . , a n+2 ), where a, e 

{0,1 } is the total number of tokens representing clusters at node i. 

There are m (n+2) reduced system states. 

We can define a formal mapping, <f>, from a system state to a reduced system state 
as follows: <l> (s 1 ,s 2 , . . . , s m ) = [a 1 ,a 2 , . . . , a n+2 ), where a, = number of «y’s 

whose second component is t . Referring to the example above, we note that 
the system state ((1,1), (2, 3), (3,1)) is represented by the reduced system state (2, 0,1, 0,0). 
It should also be noted that the system states ((1,1), (2, 3), (3,1)), ((1,1), (2,1), (3, 3)), and 
((1,3), (2,1), (3,1)) are all represented by the same reduced system state. 

We use the reduced system states as the states of the Markov chain. The transi- 
tions between these states is defined by the relevant serv ice rates of each of the nodes in 
the closed queueing network. It has been stated by Kleinrock [13] that a closed queueing 

model of this type with K customers and N nodes has J = ^ ~ 1 ) states in its 

Markov chain representation. For our model, we have m customers (clusters) and n +2 
nodes. From this Markov chain, a J X J transition rate matrix, A, can be formed and 
used to derive the steady state probabilities for each state in the Markov chain. This 
involves solving the matrix equation Ax=0, where x = (xj,z 2 , . . . , xj ) and x i 

represents the steady state probability of the system being in state i . A nontrivial solu- 

/ 

tion results when the constraint J] x % — 1 is considered. The existence of such a solu- 

i=i 

tion is based on the fact that we have constructed a finite state, irreducible, and 
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recurrent Markov chain.* Since it is possible for a token in the queueing model to move 
from one node to any other node, either directly or through some intermediate nodes, 
and there is a non-zero probability that a token leaving a node will return to that node, 
the Markov chain is indeed irreducible and recurrent. 

Once the steady state probabilities are determined, two useful results concerning 
the multiprocessor system can be quickly obtained. One is the probability that a cluster 
is idle. This is simply the sum of the probabilities for each of the Markov chain states 
that represent having one or more clusters at node 2. The other result is the amount of 
system bus contention. When there is more than one cluster at node 1, there is a cluster 
waiting to obtain bus control. Again, all that has to be done is to sum the probabilities 
for each of the Markov chain states that represent having more than one cluster at node 
1. Recall that node 1 includes both the priority queue and the transmission server. 
These two results are necessary to produce a performance measure of any type. 

A third result can also be easily obtained. It would be interesting to know how 
long a cluster would have to wait, on the average, if there is contention for the system 
bus. It has been shown by a number of authors that the average queueing time for cus- 
tomers of a given priority class in a non-preemptive priority queue can be determined 
[14-16]. The average queueing time for a customer of priority class * is 



where 

k = the number of priority classes. 

ay = the probability that an arriving customer is of class j . 
4 A unique steady state solution exists for this type of Markov chain. 
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fij = the mean service rate of a customer of class j . 

Cj ~ the second moment of the service-time distribution for customers of 
class j. 

k 

The mean queueing time of all customers is W q = J] a j Wj • 

;=i 

For the model described here, all clusters requesting service at node 1 require the 
same amount of service time. Therefore, for this example we have k — n , /q = p s for 
2 

all t , and c, = — - for all i . We then arrive at the average queueing time for a cluster 
t*s 

about to work on a task from job class t , Wj . 



It should be noted that IV, is the average queueing time only. The total time a custo- 
mer spends at node 1 is the sum of the queueing time and the service time. 

The only difficult part of deriving W i is determining the values of each of the a, ’s. 
To do this, let p(s) equal the steady state probability of being in state s of the Markov 
chain. Let s u be the set of states of the Markov chain representing j clusters at node 
« • The rest of the clusters, if any, may be at any of the remaining nodes. Then, 

r . m 

a, = — where r, = fi , £ £ j p(t). 

£ r y t=l 

y= i 

2.0. Description of Experimental System: FTMP 

FTMP is a highly reliable multiprocessor installed in the AIRLAB at NASA Lang- 
ley Research Center. This machine is intended to be used for real-time control of com- 
mercial aircraft of the next decade. Because of the disastrous effects that could occur if 
this computer should fail while in use, NASA has determined that the probability that 
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this system could fail should be less than 10"® for a 10 hour flight. This obviously calls 
for extremely rigid performance criteria. 

The hardware structure of FTMP consists of ten identical Line Replaceable Units 
(LRU’s)[17]. Each LRU includes a processor module which contains local cache memory, 
a shared 16K word memory, a 1553 I/O port, two bus guardian units, a clock generator, 
and a power subsystem. Any three processors can be grouped together into a triad. The 
processor remaining after forming three triads is reserved as a spare processor. Ten 
memory modules are also formed into three triads and a spare. Communications 
between processors and the shared memory are accomplished through serial system 
buses: that is, a data transmit bus(T-bus), a data receive bus(R-bus), and a polling bus 
(P-bus) for resolving bus contention. The system buses are also arranged as triads by 
activating three out of five. Therefore, from the programmer’s viewpoint, there is only 
one system bus. 

System configurations are controlled by bus guardians which assign the connections 
between processors and the P-bus or T-bus, and between shared memory and the R-bus. 
Two bus guardians at each LRU form a dyad such that any transmission to system 
buses will be enabled only when both guardians agree. The bus guardians are also used 
as a voter for any processor or memory triad. Since three processors in one triad are 
operating in tight synchrony, their respective bus guardians should receive three identi- 
cal data under a fault-free condition. When there is a disagreement, an error is con- 
sidered to have occurred, but masked, and the task execution will continue. Meanwhile, 
the disagreement will be recorded at an error latch for later identification of the faulty 
module or bus. From the user’s or software’s standpoint, the FTMP is regarded as a 
three processor system and has a shared 48K system memory among the three as shown 
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in Figure 4. The interested reader is referred to [17] for a complete architectural descrip- 
tion of FTMP. What has been stated here is sufficient for the present discussion. 

The software of FTMP i3 divided into five groups. They are the Executive 
Software, Facilities Software, Acceptance Test/Diagnostic Software, Applications 
Software, and Support Software [18]. Most of the tasks in each of these software groups 
have to be dispatched at regular intervals to handle repetitive applications such as flight 
control, configuration control, fault detection, recovery, as well as system displays. To 
do this, FTMP has a dispatch algorithm that initiates tasks at their required frequencies. 
Taking into account the type of application, the FTMP developers determined that 
tasks had to be executed at three different frequencies, and the type of action performed 
by the task determined which rate group the task belonged in. They termed the three 
rate groups Rl, R3, and R4. Their respective nominal frequencies are 3.125, 12.5, and 25 
Hz. Tasks required to execute at a particular frequency are given priority to access sys- 
tem components over tasks that are initiated at lower frequencies. This implies that 
tasks in the R4 rate group have priority for bus access over tasks from rate group R3, 
etc. 

Fault detection, identification, and system reconfiguration are handled by an execu- 
tive program called the System Configuration Controller (SCC) which is dispatched at 
the slowest rate Rl. Thfe is done so the execution of the SCC will have a minimal effect 
on the system workload, and the errors generated by a single fault will have an 
appropriate system response. For experimental purposes, there are two application tasks 
installed on the FTMP: auto-pilot and display programs. 

The associated fault injection system is controlled by a host VAX-11/750 computer. 
The injection extenders can be inserted into any chips at LRU3 and their respective 
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Figure 4. A block diagram of FTMP (from [15]). 
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socket holes such that the electrical connection between pins and the circuit board 
becomes controllable. Thus, three types of faulty signals, i.e., inverted signal, stuck-at-1, 
and stuck- at-O, can be injected at the pin level. Before any injection, the host computer 
will signal the FTMP to activate LRU3 for the fault injection. The detection, identifica- 
tion and reconfiguration intervals are measured by reading a real-time clock and the 
responses from the FTMP. This information is then transferred from the FTMP to the 
VAX-11/750 via a 1553 I/O port and a communication interface. Fault injection opera- 
tions are processed by the FIS (Fault Injection System) on the VAX-11/750. The FIS 
consists of a command interpreter, an injection handler, and an FTMP-VAX interface 
program. 

Recently, we have conducted some experiments on FTMP to measure some factors 
relating to bus contention, and the polling sequence. The results are summarized in 
Table 3. These results pertain to the fault free system with three operating triads. As 
can be seen, with the software presently on the system, there is a large amount of bus 
contention. Although a triad usually succeeds in its first poll sequence, it must wait 
47% of the time for the bus to become free. However, it was noticed in performing the 
measurements that the bus was usually busy for only a very short period. The busy 
period was of a significant duration in only a few instances. It is also interesting to note 
that the duration of a bus transaction is one quarter the time between bus requests. 
This is probably why the bus is busy so often when a bus request is made. 

2.7. Queueing Model Representation of FTMP 

It is obvious that the architecture and software structure of FTMP fit nicely into 
our queueing model. One can represent the three triads as clusters, and each of the rate 
groups as a job class. Job class 1 is rate group R4, because of the relative priorities of 
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P(Bus is busy when a bus request is made) 

= 0.47 

P(Bus is free when a bus request is made) 

= 0.53 

P(Succeed in first poll sequence) 

= 0.92 

P(Lose first poll sequence) 

= 0.08 

P(Succeed in second poll sequence) 

= 1.00 

Ave. idle time waiting for free bus, if lost 
poll sequence 

= 32.2 pt 

Ave. idle time waiting for free bus, if busy 
when request was made 

= 21.0 fit 

Ave. duration of bus transaction 

= 36.4 fit 

Ave. time between bus requests 

= 140.9 lit 


Table 3. Experimental Measurements 
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the rate groups and job classes. Likewise, job class 2 is R 3 , and job class 3 is Rl. There 
is some dependence when tasks from a rate group are executed based on the state of 
tasks of a higher priority rate group. However, these can be handled by the model by 
increasing the number of job classes. For the purpose of illustration, these dependencies 
are assumed to be negligible. In the queueing model representation of FTMP we, there- 
fore, have five nodes and three tokens representing clusters, i.e., n=3 and m=3. By the 

formula mentioned earlier, the Markov chain representation of this specific model has 
7 

J = (4) =35 states. These states and their respective reduced system states are 
described in Table 4 . 

To solve the Markov chain, the values for the parameters of the queueing model 
have to be determined. Sample values are outlined in Table 5. The value for fi s was 
obtained from the experimental data. Pj was arrived at from the documentation on 
FTMP[I8]. The other parameters were arrived at through reasonable assumptions, or 
realistic relations among the service rates. The computed steady state probabilities for 
the states of the Markov chain using these parameter values is shown in column 3 of 
Table 4 . Columns 4 , 5 , and 6 of Table 4 are the steady state probabilities when the 
parameter n s is varied, and the rest of the parameters remain constant. 

Using the information supplied by Table 4 , some simple results can be stated. The 
probability that there is an idle cluster is the sum of the steady state probabilities for 
the Markov states where there are one or more clusters at node 2. These are states 2, 6, 
7 , 8, 9 , and 16 thru 25 . The idle probabilities for the different values of fi s are shown in 
Table 6. These numbers are extremely low, implying that rarely is a triad idle. The 
probability that there is bus contention is the sum of the steady state probabilities of 
states 1 thru 5 (states representing more than one cluster at node 1). These results are 
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Markov States 


Computed Steady State Prob. 


— 

State 

Reduced System State 

=0.0275 

fi s =0.00275 

/i5 =0.0138 

P5 =0.055 


( 3, 0, 0, 0, 0 ) 

0.022 

mem 

mm 

0.004 


( 2, 1, 0, 0, 0 ) 

0 

■mSESm •; 

■k 

0 


( 2, 0, 1, 0, 0 ) 

0.039 

0.106 

0.087 

0.013 


(2,0, 0,1,0) 

0.039 

0.106 

0.087 

0.013 


( 2, 0, 0, 0, 1 ) 

0.037 

0.099 

0.081 

0.013 

0 

( 1,2, 0,0,0) 

0 

0 

0 

0 

7 

( 1, 1, 1,0,0) 

0.001 

0 

0.001 

0.001 

8 

( 1, 1,0, 1,0) 

0.001 

0 

0.001 

0.001 

g 

( 1, 1,0, 0, 1 ) 

0.001 

0 

0.001 

0.001 

10 

( 1,0, 2,0,0) 

0.036 

0.010 

0.039 

0.024 

li 

(1,0, 1,1,0) 

0.071 

0.019 

0.079 

0.048 

12 

(1,0, 1,0, 1) 

0.067 

0.018 

0.074 

0.045 

13 

( 1,0, 0,2,0) 

0.036 

0.010 

0.039 

0.024 

14 

( 1,0, 0, 1, 1 ) 

0.068 

0.018 

0.073 

0.045 

15 

(1,0, 0,0,2) 

0.031 

0.008 

0.034 

0.021 

16 

( 0, 3, 0, 0, 0 ) 

0 

0 

0 

0 

17 

( 0, 2, 1, 0, 0 ) 

0 

0 

0 

0 

18 

( 0, 2, 0, 1, 0 ) 


0 

0 

0 

19 

( 0, 2, 0, 0, 1 ) 

■Eb 

0 

0 

0 

20 

( 0, 1, 2, 0, 0 ) 


0 

0.001 


21 

(0, 1, 1, 1,0) 


0 

0.001 

1 

22 

( 0, 1, 1, 0, 1 ) 


0 

0.001 j 


23 

( 0, 1, 0, 2, 0 ) 

0.001 

0 

0.001 

0.001 

24 

(0,1,0, 1,1) 

0.002 

0 

0.001 

0.003 

25 

( 0, 1, 0, 0, 2 ) 

0.001 

0 

0.001 

0.001 

26 

( 0, 0, 3, 0, 0 ) 


MEM 
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( 0, 0, 2, 1, 0 ) 





28 

( 0, 0, 2, 0, 1 ) 

0.060 



0.081 

29 

( 0, 0, 1, 2, 0 ) 

0.064 

0.002 

0.035 

0.087 

30 

(0, 0,1, 1,1) 

0.120 

0.003 

0.066 

0.163 

31 

( 0, 0, 1, 0, 2 ) 

0.056 

0.002 

0.031 

0.076 

32 

( 0, 0, 0, 3, 0 ) 

0.021 

mssm 

0.012 


33 

( 0, 0, 0, 2, 1 ) 

0.060 


0.033 


34 

( 0, 0, 0, 1, 2 ) 

0.056 

m 


1 

35 

( 0, 0, 0, 0, 3 ) 

0.018 

0 

0.010 

0.024 


Table 4. Markov State Descriptions and Steady State Probabilities 
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Ps 

_ 1 
36.35 

= 0.0275 

P/ 

= 5-Pi 

= 0.0458 

Pi 

1 

= 3 * 8 

= 9.17X10-* 

P2 

1 

= 6 ^ 

= 4.58X10-* 

Ps 

_ 1 
16.87 

= 1.63X10* 



Table 5. Parameter Values 
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Table 6. Idle Processors and Bus Contention Probabilities 
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also shown in Table 0. Figure 5 shows the effect of changing the service rate at node 1 
on bus contention. The probability of bus contention increases dramatically as the ser- 
\ ice rate approaches zero, as expected. Figures 6-8 show the effects of varying 
and Pj, respectively. One could derive from these graphs the sensitivity to a change in 
performance caused by a change in a service rate or branch probability. 

There are numerous other conclusions that could be drawn from the results of this 
example. The sensitivities of varying other parameters, average queueing times, and 
degraded system performance are just a few. It can be seen that this model is useful in 
analyzing many of the aspects that are vital to any performance evaluation. It is impor- 
tant to note that all the parameters of the queueing model are ones that can be meas- 
ured. 

3. MEASUREMENT OF FAULT LATENCY 
3.1. Introduction 

A hardware fault is defined a3 an incorrect state caused by the physical change in a 
component, whereas an error is defined to be the erroneous information/data resulting 
from the manifestation of a fault. Even after a hardware fault occurs in a computer sys- 
tem, the system will remain error-free until the fault manifests itself. Before its manifes- 
tation, the fault is latent and is not harmful to any system operations. Thus, there are 
two time intervals of interest between fault occurrence and error detection: fault latency 
and error latency (see [21] for a detailed description of these). Obviously, error latency 
depends on the detection mechanisms 5 used. Fault latency is dependent on the location 
and the type of the fault, and the degree of usage of the faulty unit. In other words, 

'Swhich we termed the function-level detection mechanisms in [21]. 
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fault latency is closely related to the physical property of a fault, whereas error latency 
represents the efficiency of the detection mechanisms used. 

In a reliable computer system, the detection and isolation of faults and errors, and 
the subsequent reconfiguration are provided to tolerate faults and errors. These steps 
must be executed correctly by fault-free subsystems. In the face of multiple faults, the 
fault-tolerance capability is reduced and the coverage of failure is incomplete. It has 
been shown that an incomplete coverage is the major threat to a highly reliable system 
[22-24]. Thus, the accumulation of latent faults and the near-coincident, occurrence of 
faults should be considered in the modeling and verification of a reliable system. How- 
ever, the conventional modeling of a reliable system usually assumes that the system is 
recovered from an extant fault if no new fault occurs during the recovery period; other- 
wise, a coverage failure results. This is true only when there is no fault latency or a 
negligible fault latency during which no new fault occurs. That is, the conventional 
works have ignored the possibility of the accumulation of latent faults. Obviously, the 
conventional approach becomes invalid if fault latency has the same order of magnitude 
as the recovery period. Due to the reasons discussed above, it is essential to accurately 
evaluate both fault and error latencies. 

In addition to the analysis of the coverage failure, the knowledge of fault latency is 
important to the study of transient faults. Clearly, a transient fault manifests itself only 
when its active duration is greater than fault latency. If fault latency is long, it is possi- 
ble that most transient faults will disappear before they harm the system. In such a case, 
the transient faults captured by some detection mechanisms cannot represent the true 
characteristics of all transient faults. 
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In the past, several researchers conducted experiments and simulations to investi- 
gate faults' manifestations and subsequent error detections by injecting hardware faults 
[25-34], Results were observed through the detection mechanisms following the fault 
injections. They measured the probability of detection and the distribution of detection 
times which are the sum of fault and error latencies. Since there does not exist a direct 
way to determine the moment of error generation, these experiments fail to indicate the 
moment of error generation which divides the detection time into fault latency and error 
latency. Instead, a combined effect of the inherent fault property and an associated 
detection operation can be observed via these experiments. Thus, these experiments nei- 
ther help us understand the behavior of fault and error generation, nor give an accurate 
measure of the capabilities of detection mechanisms. In order to remove this inade- 
quacy, we develop here a methodology to measure fault latency; with the measured fault 
latency and detection time, error latency can also be computed. 

3.2. Methodology for Measurement of Fault Latency 

Suppose there are some detection mechanisms which are able to detect the error 
generated by a fault / . Let t f represent the fault latency of this specific fault, which is 
a random variable with the distribution function Fj[t). We inject the fault / n, 
times, and each injection is held active for the duration f, . If t f is greater than (, , then 
no error will be generated. Otherwise, the fault manifests itself, inducing an error which 
will be captured later by the detection mechanisms. If there are </,■ detections among 

these n,- injections, then the ratio — - indicates the probability that an error is gen- 
erated during the fault active duration f,- . This is equivalent to the probability that the 
fault latency is smaller than f, . Thus, we obtain the distribution function of fault 
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latency for the fault / as follows: 


F, (ti) = Pr°b (*,<<,) = (1) 

Notice that this measurement of fault latency is not affected by error latency. This 
also implies that the result of measurement is independent of the efficiency of detection 
mechanisms. Thus, as long as the error induced by the fault / can be detected, we can 
obtain the distribution of fault latency for the fault / . 

We cannot overemphasize the fact that the moment of error generation is not 
directly observable. Although the occurrence of a logic failure caused by a fault can be 
identified by voting, the logic failure does not always induce an error at the function 
level. In other words, there may not exist a sensitized path such that the faulty signal 
can propagate to the output stage. Consequently, we have proposed a new methodology 
to indirectly measure the fault latency. Due to the “indirect” nature of our measure- 
ment, we obtain the distribution of fault latency instead of actual samples of fault 
latency. Clearly, this fact does not allow for any rigorous statistical analysis of our 
experimental data. However, to our best knowledge, the proposed indirect methodology 
is the first and the only attempt to measure fault latency. 

3.3. Experimental Results and Analysis on FTMP 

For our experiments, the original FIS (Fault Injection System) has been modified to 

enable us to inject transient faults . 6 Additional features are added to the command 
interpreter such that the active duration of a transient fault can be specified and passed 
to the injection handler. Injection ends if either the response of the FTMP indicates the 
accomplishment of detection, identification and reconfiguration, or the active duration 

®The original FIS is designed for injecting permanent faults only. 
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becomes larger than the specified value. In the latter case, FIS is made to wait a few 
seconds for a possible response from the FTMP. 

To measure fault latency and demonstrate the methodology proposed above, tran- 
sient faults were injected to four circuit boards of the FTMP, i.e. CPU Data Path, CPU 
Control Path, Cache Controller, and System Bus Controller. The first three boards are 
in the CAP6 processor/cache region which is constructed with the AMD 2900 series bit- 
slice microprocessors. The System Bus Controller is responsible for transferring blocks of 
words between a local processor region and the shared memory. It also serves as a syn- 
chronizing mechanism such that the processors in a triad can be brought into full syn- 
chrony. On each board, several pins are selected for injecting transient faults. Selection 
of boards and pins is made arbitrarily. For each pin, stuck-at-0, stuck-at-1, and 
inverted signals are injected. 

A prime test was applied to each selected pin to observe whether or not an error is 
generated after the injection of a permanent fault (which has an active duration of 3 
seconds or more). In Wimmergren’s experiments on the FTMP [33], undetected faults are 
reported to exist. Possible explanation for the existence of undetected faults are: (1) the 
circuits are not exercised, (2) there are “don’t care” or redundant pins, and (3) the 
injected fault does not cause any logic failure. In our experiments, injection of transient 
faults is not made if there is no detection during the prime test. At certain pins, errors 
are detected when stuck-at-0 and inverted signal faults are injected, but not stuck-at-1 
faults. In such a case, injection of stuck-at-1 faults is omitted. 7 

For each pin, transient faults with different active durations are injected 10 to 40 
times repeatedly. At an early experiment, we found that rf, /n, increases sharply when 

7 ObviousIy, there is no use of such an injection. 
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the transient durations are small. Thus, to have good resolution, the active duration of 
the transient faults injected, denoted by f, , are not equally distanced. That is, we used a 
finer resolution for small t,- ’s and a coarser resolution for large f ,• ’s. Moreover, since the 
fault latency at the System Bus Controller board is much larger than that at the other 
boards, f, ’s used for testing this board are different from those used for the others. 

Among more than 20,000 transient faults injected, only 15,111 results are used for 
the analysis. The other data are regarded unreliable because: (1) the fault identified by 
the FTMP was not in the LRU where the fault was actually injected, (2) the FTMP 
crashed during the fault injection, and (3) one of the detection, identification and recon- 
figuration times was negative. If the second case occurred, the injection was performed 
again. For every » and each type of fault at a pin, using the measured rf, / n, , we 
obtained the averaged «/, /n,- —Fj (f,- ) for each board, which are listed in Table 7. In 
addition, we present h/ ( t , ) in the table which is defined as 


*,(<•)- 


(<i+l-<. )(!-*> (<i)) 


( 2 ) 


The function />/(<,) becomes the hazard rate of fault latency as — *0. 

Despite the fact that negative numbers appeared twice in Table 7, the functions hj (f, ) 
in the table strongly suggest that the hazard rate of fault latency be monotone decreas- 
ing. Thus, two distributions with monotone decreasing hazard rates, i.e., Weibull and 
Gamma distributions, are used to fit the experimental results. Estimated parameters are 
given in Table 7 where the least-squares errors are also included. The experimental 
results and the estimated Weibull distribution are plotted in Figures 9 through 12. 

The estimated parameter for exponential distributions is also presented in Table 8 
for the purpose of comparison with Weibull and Gamma distributions. It can be seen 
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m 
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0.0 

0.0 

21.0 

0.0 

9.0 

0.0 

35.0 

0.01 

0.21 

1.27 

0.09 

0.36 

0.35 

3.59 

0.10 

0.30 

0.071 

0.12 

0.20 

0.56 

0.40 

0.50 

0.32 

0.23 

0.19 

0.074 

0.63 
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(a). Experimental Results and hj(t t ) on Cache Controller. 
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0.94 

1.67 

0.98 

-3.75 

- 

0.50 

0.98 

2.00 
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0.11 

- 

10.00 

0.98 

0.01 

1.00 

- 

- 

20.00 

1.00 

- 

1.00 


* 


(b). Experimental Results and h/t,) on CPU Control Path. 


Table 7. Experimental Results and Estimated hft,). 
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0.83 
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(c). Experimental Results and h^t) on CPU Data Path. 
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(d). Experimental Results and h^t) on System Bus Controller. 


Table 7. Experimental Results and Estimated hj[l,) (cont'd). 


45 









Figure 9. The Experimental Results and Estimated Distributions for Stuck-at-0 
Faults. 
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Figure 10. The Experimental Results and Estimated Distributions for Stuck-at-l 
F aults. 
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Figure 11. The Experimental Results and Estimated Distributions for Inverted 
Signal Faults. 
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Figure 12. The Experimental Results and Estimated Distributions of Fault Laten- 
cies at System Bus Controller. 
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Exponential 
1/X error 

Weibull 

1/X q error 

Gamma 

1/X a error 

s-a-0 

CC s-a-1 

inverted 

4.78 0.24 

13.07 0.08 

0.46 0.41 

4.35 0.35 0.03 

15.24 0.51 0.015 

0.56 0.20 0.009 

45.89 0.24 0.02 

61.61 0.38 0.006 

82.90 0.11 0.007 

CPUC s-a-0 

s-a-1 

0.009 0.004 

0.005 0.003 

0.0076 0.39 0.0006 

0.001 0.27 0.0025 

0.117 0.19 0.0008 

0.092 0.09 0.0029 

s-a-0 

CPUD s-a-1 

inverted 

0.515 0.539 

0.628 0.31 

0.036 0.115 

1.488 0.21 0.021 

0.799 0.23 0.006 

0.030 0.29 0.0026 

153.9 0.12 0.018 

56.79 0.13 0.0013 

0.648 0.18 0.032 

s-a-0 

SBC s-a-1 

inverted 

125.2 0.063 

46.9 0.097 

34.4 0.029 

124.9 0.89 0.061 

54.85 0.58 0.020 

39.10 0.70 0.0045 

173.2 0.77 0.057 

176.18 0.44 0.021 

80.44 0.58 0.0066 


CC — Cache Controller, CPUC — CPU Control Path 
CPUD — CPU Data Path, SBC -- System Bus Controller 


Table 8. Least-Squares Estimation of the Distributions of Fault Latencies. 
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that the constant error generation rate (i.e. exponential distribution) does not model the 
error generation well. The mean fault latencies -- which are 1/ X in the estimated 
parameter of the exponential distribution — range from 0.0005ms of stuck-at-1 faults in 
the CPU Control Path to 125ms of stuck-at-0 faults in the System Bus Controller. This 
was due to the different exercise rates at each board. Since each injected stuck-at-0 or 
stuck-at-1 fault does not always represent a logic failure at the moment of injection, the 
fault with an inverted signal should have a shorter fault latency: this is confirmed by the 
experimental results. 

As pointed out earlier, fault latency is not directly observable. This fact has led us 
to the development of a new methodology which allows for indirect measurement of fault 
latency. Note, however, that our experimental results give the distribution function of 
fault latency instead of data samples of fault latency. Hence, statistical analyses or 
hypotheses testing are not applicable to these experimental data. The least-squares esti- 
mation with commonly used distributions gives only approximate values of the parame- 
ters. They cannot test whether an underlying model is (statistically) good or bad. 
Indeed, from the least-squares errors in Table 8 it i3 unclear which distribution has the 
best fit. However, since the hazard rate converges to 1/ X and 0 for Gamma and 
Weibull distributions, respectively, it is possible to distinguish between them once addi- 
tional injections with larger active durations are performed. 

4. CONCLUSION AND DISCUSSION 

In this report, we have presented first a model to be used to study the workload 
effects on performance for a highly reliable unibus multiprocessor used in critical real- 
time applications. Because of the strict performance criteria required for systems of this 
type, a detailed analysis is both desirable and necessary. 
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The operation of the computing system addressed has been illustrated using a 
modified Stochastic Petri Net(SPN). It was the purpose of this model to graphically 
describe the synchronous operation of multiple processing clusters. It was desired to 
show which aspects of the computer’s operation have the most significant effect on the 
computer’s performance. Most certainly, system bus contention, workload distribution, 
and idle processing periods have a marked effect on performance. 

The modified SPN was useful for the purpose of describing computer activity. 
However, as a tool for performance evaluation, it was shown to be too complex for 
worthy analysis. A simpler model has been presented that still describes the critical per- 
formance related facets. This model is a closed queueing network consisting of mul- 
tiserver nodes and a single non-preemptive priority queue. 

The queueing model was shown to be easily solved for a set of given parameters. It 
was also observed that useful results pertaining to system performance could be directly 
obtained from the solution to the queueing model. The ease of obtaining these results 
and the overall importance of the results demonstrate the usefulness of the model for the 
purpose of performance evaluation. 

The area that merits further research is in determining the distribution of the 
workload among different job classes. A systematic method has not been developed yet 
to construct the various job classes from the workload of a real-time control system. 
Characterization of real-time workloads is a more restricted problem than dealing with 
the workloads of a general purpose computer. This motivates continued research in 
solving the workload distribution problem. Once a characterization method is developed, 
one can then consider the possibility of obtaining an optimal workload distribution to 
provide optimal performance. 


52 


We have also developed a new methodology for indirectly measuring fault latency 
with the injection of faults. The methodology has been realized by experiments on the 
FTMP. The FTMP experimental results show a large variation in fault latencies for dif- 
ferent circuits. It has also been observed that the hazard rate of fault latency is mono- 
tone decreasing. This implies that a fault tends to be latent if it did not generate an 
error at its early stage. The existence of a long fault latency should not be ignored in 
highly reliable systems. To reduce the accumulation of latent faults, additional on-line 
diagnostics must be incorporated into the area where a long fault latency exists . 8 

Although two possible distributions are used to fit the experimental results, no 
underlying model for fault latency can be concluded. It is mainly because of the unob- 
servability of error generation. More experiments should be designed to investigate the 
behavior of a fault and its effect on system execution. An immediate extension of our 
experiments is to make the injections under different system workloads or the execution 
of different application tasks. We expect to see some variations of fault latency in cer- 
tain circuits. 

During the FTMP experiments, some interesting points were observed, especially 
when the faults were injected into the System Bus Control. At certain pins, identifica- 
tion results were different for various active durations of injections. For instance, with a 
long (in relative to fault latency) active duration, the SCC indicated that the whole LRU 
was faulty, but indicated that only a processor or memory was faulty when the active 
duration was short. This situation was sometimes reversed. In other words, the identifi- 
cation results by the SCC depend on both the location of injection and the active dura- 
tion of the fault. For the injections in the other boards, e.g., Cache controller, CPU 

®Such areas can be identified by the methodology proposed in this report. 
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Data and Control Path, a processor was identified as faulty. Note that the System Bus 
Controller is the interface between the processor region and system buses. These obser- 
vations show that the errors do not propagate out of the processor boundary. They also 
suggest that an error easily propagates from interface circuits, but the identification of a 
faulty interface circuit is more difficult. 

In addition, we encountered several problems that were inconsistent with the 
FTMP’s specification. This forced us to abandon some experimental results. Specifi- 
cally, fault injections to the System Bus Controller caused the FTMP to generate fre- 
quent system crashes or have wrong identifications. Certainly, the FTMP could not dis- 
tinguish between the injection of a fault from the true occurrence of a fault. These 
abnormalities occurred too frequently to be treated as random failures. In addition, only 
210 responses from the FTMP indicated that the detected faults were transient, even 
when faults with a 10 micro-second active duration were injected. In fact, all injections 
of transient faults in the Cache Controller, CPU Data and Control Path were regarded 
as permanent. A thorough verification is needed for the FTMP’s detection and identifi- 
cation mechanisms. This is a matter for our future research. 
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